feat: add dynamic shapes kernel specialization strategy for TRT-RTX#4184
Merged
lanluo-nvidia merged 2 commits intopytorch:mainfrom Apr 21, 2026
Merged
Conversation
tp5uiuc
commented
Apr 12, 2026
tp5uiuc
commented
Apr 12, 2026
7 tasks
c222c72 to
385eec6
Compare
Expose IRuntimeConfig.setDynamicShapesKernelSpecializationStrategy()
through the Torch-TensorRT Python API. Users can now control how
shape-specialized kernels are compiled at runtime for dynamic shapes
on TensorRT-RTX via the new `dynamic_shapes_kernel_specialization_strategy`
compilation setting ("lazy", "eager", or "none").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review feedback: compile with torchtrt.Input min/opt/max ranges so dynamic shapes are actually exercised. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
385eec6 to
d7619ca
Compare
lanluo-nvidia
approved these changes
Apr 20, 2026
Collaborator
lanluo-nvidia
left a comment
There was a problem hiding this comment.
lgtm, one minor comment.
| hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer) | ||
| timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation. Not used for TensorRT-RTX. | ||
| runtime_cache_path (str): Path to the runtime cache for TensorRT-RTX JIT compilation results. Not used for standard TensorRT. | ||
| dynamic_shapes_kernel_specialization_strategy (str): Strategy for dynamic shape kernel specialization at runtime (TensorRT-RTX only). Options: "lazy", "eager", "none". Default: "lazy". |
Collaborator
There was a problem hiding this comment.
Can we add a warning or check in case user configured dynamic_shapes_kernel_specialization_strategy in TensorRT
Contributor
Author
There was a problem hiding this comment.
This is a good suggestion Lan, I have a followup task to emit user warnings for
- timing cache used in TRT-RTX
- runtime cache used in standard TRT
- dynamic shape strategy used in standard TRT
- cudagraphs flag used in standard TRT
so that its easier to review the change/behavior. I will put it in then
10 tasks
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 22, 2026
Address the structural PR feedback by extracting TensorRT-RTX-specific
IRuntimeConfig state into its own type and collapsing the per-feature
appliers that previously scattered `#ifdef TRT_MAJOR_RTX` through
TRTEngine.
What
- New core/runtime/TRTRuntimeConfig.{h,cpp} owns the IRuntimeConfig
shared_ptr plus (on TRT-RTX) the IRuntimeCache, runtime-cache path,
dynamic shapes kernel strategy, CUDA graph strategy, and the
rtx_native_cudagraphs_disabled one-shot flag. All per-feature
appliers live there as public members and are no-ops on non-RTX
builds, keeping the only `#ifdef TRT_MAJOR_RTX` scatter contained
in this new file.
- Strategy fields are now strongly-typed enums
(`DynamicShapesKernelStrategy`, `CudaGraphStrategyOption`) with
matching `to_string`/`to_int` helpers, validated at engine
construction via `to_dynamic_shapes_kernel_strategy` / `to_cuda_
graph_strategy_option` rather than raw int ranges.
- `TRTEngine::recreate_execution_context` is now backend-agnostic:
it calls `runtime_cfg.ensure_initialized`, applies the allocation
strategy, and creates the execution context via
`createExecutionContext(IRuntimeConfig*)`. Both standard TensorRT
and TRT-RTX go through this uniform path; only the three RTX-only
setters (`setRuntimeCache`, `setDynamicShapesKernel
SpecializationStrategy`, `setCudaGraphStrategy`) stay behind an
`#ifdef TRT_MAJOR_RTX` guard inside the struct.
- `~TRTEngine` now wraps cleanup in try/catch and delegates cache
persistence to `TRTRuntimeConfig::save_runtime_cache_nothrow`, so
stack unwinding can no longer propagate a cache-save failure out
of the destructor.
- `save_runtime_cache_nothrow` uses `std::filesystem` + atomic
`tmp+rename` only; file locking is out of scope for this PR and
will be introduced in a follow-up once we pick a portable
mechanism.
- `is_monolithic_capturable` asserts `exec_ctx` is non-null; the
three RTX-only appliers `TORCHTRT_ASSERT` that `config` is live
before dereferencing.
- `disable_rtx_native_cudagraphs` persists the runtime cache before
flipping the strategy so any kernels compiled under the internal
capture survive to the next reload.
- `TRTEngine::to_str` now emits human-readable strategy names (via
`to_string(enum)`) instead of integer codes.
- New serialization indices (`RUNTIME_CACHE_PATH_IDX`, `DYNAMIC_
SHAPES_KERNEL_STRATEGY_IDX`, `CUDA_GRAPH_STRATEGY_IDX`) are now
`#ifdef TRT_MAJOR_RTX`-gated in runtime.h, register_jit_hooks.cpp,
the FlattenedState tuple, the serialize/deserialize constructors,
and `__obj_flatten__`. Standard TRT builds keep `SERIALIZATION_LEN
== 11` so engines serialized there do not carry RTX-only slots.
- Python `_TorchTensorRTModule` reads the RTX-only index accessors
and writes the RTX-only engine-info slots only when
`ENABLED_FEATURES.tensorrt_rtx` is true. Standard TRT users see
no new behavior at runtime.
- Deduplicated `_compiler.py` arguments after rebase on upstream
main where PR pytorch#4184 had already added
`dynamic_shapes_kernel_specialization_strategy`. Kept one copy of
each arg; `cuda_graph_strategy` is threaded through all three
compile() entry points.
Build + tests
- RTX build on A100 / L40S: libtorchtrt.so and libtorchtrt_
runtime.so link clean, no `#ifdef` diagnostics. Pre-commit checks
pass (clang-format, black, isort, ruff, mypy, typos, buildifier).
- All 35 runtime-cache/strategy tests pass; regression across
test_000_runtime_cache.py (Python runtime), test_002_cudagraphs_
cpp.py, test_005_dynamic_allocation.py is green.
Addresses review comments on PR pytorch#4202:
- Guarding of new IDX entries and Python accessors on
TRT_MAJOR_RTX / ENABLED_FEATURES.tensorrt_rtx.
- Encapsulation of RTX-specific state in a dedicated type with
enumerated strategies and transparent standard-TRT/RTX behavior.
- Destructor exception safety.
- Unification of the execution-context creation path via
IRuntimeConfig.
- Removal of file locking for runtime-cache persistence.
- Debug asserts before dereferencing the live IRuntimeConfig.
- Human-readable to_str output.
- save_runtime_cache invoked from disable_rtx_native_cudagraphs.
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 22, 2026
Address PR review comments that asked the new C++ runtime tests be folded into existing feature-level files rather than shipped as parallel `*_cpp.py` files. What - Merge `test_000_runtime_cache_cpp.py` into the existing `test_000_runtime_cache.py`. The file already covered the Python runtime path; two new classes (`TestRuntimeCacheCppPersistence`, `TestCppSerializationIndices`) cover the C++ runtime path via `use_python_runtime=False`, and the serialization-index assertions. Skip on non-RTX builds. - Fold the C++ runtime cases for dynamic shapes kernel specialization strategy into `test_001_dynamic_shapes_kernel_ strategy.py` (introduced upstream in PR pytorch#4184). Two new classes (`TestDynamicShapesKernelStrategyCpp`, `TestDynamicShapesKernel StrategyCppInvalidValue`) exercise lazy/eager/none end-to-end and reject invalid strategy names. The pre-existing Python runtime tests remain untouched. - Rename `test_000_cuda_graph_strategy.py` to `test_001_cuda_graph_ strategy.py` to match the `test_001_*` convention used for L1 RTX-only features. When upstream lands the Python runtime counterpart (PR pytorch#4187), both sets fold into the same file. - Add model-level tests: `test_runtime_cache_models.py` gains a `TestRuntimeCacheCppModels` class exercising ResNet18 through the C++ runtime with warm-cache roundtrip. `test_dynamic_shapes_ kernel_strategy_models.py` gains `TestDynamicShapesKernelStrategy CppModels` covering lazy/eager/none on ResNet18 via the C++ runtime. Verified - 35 passed / 3 skipped in the runtime/ tests (merged file plus test_001 strategy files). - No regression in test_002_cudagraphs_cpp.py (8 passed) or test_005_dynamic_allocation.py (1 passed). Addresses PR pytorch#4202 review comments asking for test file merges and the addition of model-level runtime_cache_models.py / dynamic_shapes_kernel_strategy_models.py coverage.
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 22, 2026
Follow-up to 54f9ccd / 1fa8c82 addressing the second batch of PR pytorch#4202 review feedback. Pure refactor with no user-visible behavior change; all tests green on A100 (35 passed / 3 skipped + 9 regression passed). TRTEngine - Constructor signature simplified: three separate `runtime_cache_path` / `dynamic_shapes_kernel_strategy` / `cuda_graph_strategy` parameters collapsed into a single `TRTRuntimeConfig runtime_cfg` sink parameter. The forwarding ctor std::moves it into the primary ctor, which std::moves it into the member. - String sink parameters (mod_name, serialized_engine, serialized_ metadata) taken by value and moved into members / slugify. - Deserialization constructor routes through the new free function make_runtime_config_from_serialized, which internalizes the TRT_MAJOR_RTX-gated index reads so the constructor itself stays unguarded. - FlattenedState uses a single TRTRTX_FLATTENED_STATE_EXTRAS macro for the three RTX-only tuple entries instead of duplicating the first eleven entries across two branches. - Destructor restored to the pre-refactor structure: torch::cuda:: synchronize runs outside a try block and runtime_cfg.save_runtime_ cache (now noexcept by signature) is called directly. Exception safety is guaranteed by the member's type, not by a defensive try/catch. - __obj_flatten__ and serialize cast enum values via std::underlying_type_t<...> instead of int so serialization stays in lockstep with any future underlying-type change on the enums. TRTRuntimeConfig - Conversion helpers take std::underlying_type_t<Enum> (the declared 32-bit integer type) instead of raw int. Callers at serialization boundaries explicitly std::stoi / static_cast into the right type. - [[nodiscard]] added to to_string, to_dynamic_shapes_kernel_strategy, to_cuda_graph_strategy_option, uses_internal_capture, is_monolithic_ capturable, to_str, and make_runtime_config_from_serialized. - to_string default cases now TORCHTRT_CHECK(false, ...) with the unexpected integer value; std::unreachable is C++23. - set_execution_context_allocation_strategy is now const. - Cache I/O split into two layers: - Free functions load_runtime_cache(path, cache) and save_runtime_cache(path, cache) perform the raw std::filesystem I/O and use TORCHTRT_CHECK on failure -- exception-propagating, easier to test in isolation. - Member TRTRuntimeConfig::save_runtime_cache() is a noexcept wrapper that calls the free function and swallows exceptions via try/catch -- safe from a destructor. The _nothrow suffix is dropped from the member name (the signature now carries that contract). - write_to_str(ostream&) replaced by two functions: a const-correct to_str() -> std::string, and a free operator<<(ostream&, const TRTRuntimeConfig&) that wraps it with "Runtime cfg { ... }" delimiters. TRTEngine::to_str streams the config via the free operator. Python - _settings.py: removed a duplicated dynamic_shapes_kernel_ specialization_strategy field and its duplicated docstring left over from the upstream rebase of PR pytorch#4184 into our changes. Covers review comments 3126538200, 3126541782, 3126547529, 3126549147, 3126682329, 3126683329, 3126693226, 3126715369, 3126725953, 3126736626, 3126738422, 3126745230, 3126747553, 3126749405, 3126764831, 3126772536, 3126786564, 3126803652, 3126816780, 3126818065, 3126818561, 3126819429, 3126823781, 3126840987, 3126846827.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Expose
IRuntimeConfig.setDynamicShapesKernelSpecializationStrategy()through the Torch-TensorRT Python API for TensorRT-RTX builds.Users can now control how shape-specialized kernels are compiled at runtime for dynamic shapes via the new
dynamic_shapes_kernel_specialization_strategycompilation setting:"lazy"(default): Compile shape-specialized kernels in the background, use fallback until ready"eager": Compile immediately (blocking)"none": Always use fallback kernels, never specializeDepends on: #4180 (runtime cache API — provides the
IRuntimeConfiginfrastructure)Type of change
Checklist: